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ABSTRACT  OF  THE  DISSERTATION 

Processing:  and  Transmitting  Information, 

Given  a  Pay-Off  Function 

by 

Henri  Michel  Ftoam-Huu-Tri 
Doctor  of  Philosophy  in  Mathematics 
University  of  California,  Los  Angeles,  1968 
Professor  Jacob  Marschak,  chairman 

An  information  system  is  defined  as  a  chain  of  information 
services,  encoding  (processing) .. .transmitting. , .decoding  (deciding). 
Each  service  is  a  transformer  represented,  in  general,  by  a  stochastic 
matrix  and  a  cost  function.  The  inputs  of  "encoding"  are  the  pay-off- 
relevant  events.  Actions  are  the  output  of  decoding,  actions  and 
events  determine  the  pay-off.  The  utility  of  the  services  to  the  user 
is  a  function  of  the  pay-off  and  of  the  different  costs.  Efficiently 
choosing  an  information  system  is  by  definition  choosing  an  information 
system  which  maximizes  the  expected  utility. 

Communication  engineers  restricted  themselves  to  information 
systems  with  fixed  transmitting  (channel)  and  identically  zero  cost 
functions.  Moreover,  they  equated  the  user's  utility  function  with 
his  pay-off  function.  They  handled  the  problem  in  the  following  way: 
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1.  choose  first  encoding  with  respect  to  the  source  of  events 
and  the  pay-off  function  only,  r.  choose  second  encoding  and 
decoding  with  respect  to  transmitting  only.  Encoding  is  the 
composition  of  first  and  second  encoding.  However,  their  approach 
was  inefficient;  1.  They  neglected  the  pay-off  function  in  the 
choice  of  second  encoding  and  decoding,  n.  they  arbitrarily  broke 
the  original  problem  into  two  independent,  more  accessible, 
problems .  , 

t  r 

I  We  also  restricted  ourselves  ‘to  information  systems  with  fixed 

transmitting  and  zero  cost  functions  and  users'  utility  functions 

■  /  , 

identical  to  their  pay-off  functions.  But  our  approach  is  more 
efficient  because  wa  treated  the  problem  of  choosing  encoding  and 
decoding,  given  a  source  of  events,  a  pay-off  function  and  a 
channel,  as  a  whole.  The  bounds  we  obtained  should,  therefore,  be 
better,  at  least  in  all  cases  where  the  pay-off  function  has  a  wide 
range  of  values.  We  did,  however,  treat  the  non-restricted  problem 
with  certain  properties  of  the  source,  the  channel  and  the  utility 
function  assumed. 
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SECTION  1 


1.1.  Introduction 

The  Economic  Theory  of  Information  is  concerned  with  the 
efficient  choice  of  Information  services.  .J.  Marschak  (Efficient 
choice  of  Information  Services,  1968.  Conference  for  Research  on 
Management  Information  Systems)  distinguishes  the  following  sequence 
of  'services  in  that  order:  Inquiring,  communicating  and  deciding. 
Communicating  is  itself  a  sequence  of  Encoding,  Transmitting  and 
Decoding.  Another  component  of  the  sequence,  called  Storing,  which 
can  be  intermediate  between  any  two  consecutive  services,  will  be 
disregarded  in  this  work,  together  with  Inquiring,  which  is  the  same 
as  assuming  that  they  are  both  costless  and  perfect.  Moreover, 
Decoding  and  Deciding  will  be  reduced,  without  loss  of  generality 
to  a  singJe  operation:  decoding  into  action.  Our  simplified  chain 
of  services,  or  information  systern  will  then  consist  of  only  three 
links:  Encoding,  Transmitting,  Decoding. 

More  precisely,  see  diagram  1.1,  there  will  be  a  source,  S,  of 
events  (or  messages,  since  inquiring  is  assumed  to  be  an  identity 
operation)  generating  the  random  variable  e  from  a  finite  set  E 
with  the  distribution  ?(*)•  There  will  be  discrete,  memory  less 
channels  denoted  (X,P(y|x),Y)  or  simply  (P(y|x))  with  finite 
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2 

input  and  output  alphabets  X  and  Y  respectively.  The  Transmission 
T(*)  :  X ->  Y,  T(»)  is  a  random  function;  the  Encoding  will  be 
denoted  ^(*)  :  E  -*  X;  the  Decoding  f  (•)  :  Y  -*  A  where  A  is 
the  finite  set  of  feasible  actions  (a}. 


Diagram  1.1 


One  of  the  criteria,  that  will  be  considered  in  the  choice  of 
services  is  the  benefit  to  the  user,  a  function  w(«)  of  e  and  a, 
called  pay-off  function:  u(*,  *)  :  Ex  A-»  Reals.  The  others  being 
the  costs  of  the  different  operations.  If  costs  would  not  depend  on 
the  chosen  information  system,  the  user  would,  by  definition,  prefer 
the  system  yielding  the  highest  expected  Pay-off: 


E  (w(e.a)}  =  )  P(e)  Prob  (a|e)w(e,a), 

L-j  “2 

e,  a 


The  subscript  is  here  to  recall  that  the  probability  of  action  a, 
given  event  e,  is  a  function  of  Encoding,  Transmitting,  and  Decoding. 

Now  if  the  costs  are  introduced:  k^  (e),  cost  of  Encoding  e, 
Mx)’  cost  of  Transmitting  x;  k^  (y),  cost  of  decoding  y,  the 
user  would  try  to  maximize  the  expected  value  of  a  certain  function 
U(w(e,a),k  (e),k^x),k  (y)),  by  definition  his  utility  function. 

h  1  *2 
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Not  much  can  be  said  about  U(  ,  ,  ,  )  besides  the  fact  that  It  is 

increasing  in  w(e,a)  and  decreasing  in  k 

Moreover,  the  costs  themselves  are  not  well  known,  especially  the  costs 

of  Encoding  and  Decoding,  depending  upon  this  complexity.  One  has 

to  resort  to  using  arbitrary  elementary  (often  linear)  function 

to  represent  U(*)>  k  (•)>  k_(*)>  k  (•)  more  or  less  realistically. 

*1  1  *2 

We  are  not  ready  at  this  point  to  approach  the  general  problem 

except  for  a  special  case:  binary  symmetric  memory  less  source, 

f  inary  symmetric  pay-off  function,  binary  symmetric  memory  less  channel, 

U(  • )  linear  in  k  (•)>  k^*),  k  (•)•  In  the  rest  of  this  work 

U  T  *2 

the  transmission  system  costs  will  be  assumed  constant  and  the  choice 
will  be  restricted  to  Transmission  Systems  with  a  fixed  Channel.  In 
other  words,  attention  will  be  devoted  to  the  following  problem, 
a  preliminary  one:  Find  Encoding  and  Decoding  procedures  that  would 
maximize  the  expected  pay-off  function  <*)(•,.).  In  doing  so,  we  will 
get  some  insight  into  the  original  problem  and  some  partial  answers 
to  it. 


(e),  k^x),  k^(y). 


1.2.  Pure  Communication  of  Information  vs.  Communication 
of  Information,  Given  a  Pay-off  Function 

What  is  usually  called  Information  Theory  is  essentially  a 
theory  of  pure  communication.  It  was  principally  started  by 
C.  E.  Shannon  in  19^8  in  his,  "A  Mathematical  Theory  of  Communication". 

Let  e  be  a  random  variable  generated  by  a  source  S,  taking 
on  a  finite  number  of  values:  1,  •  •  •, g,  •  •  »,G  with  probabilities 
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P(l),  •••,P(g).  The  uncertainty  associated  with  e  was  quite 
arbitrarily  defined  to  be  the  quantity: 

h(c)  -  h(p(i), • • *i p(g) )  -  -  ;  P(g)  log  P(g) 

/  _l 

g-1 

where  -  log  (P(g))  was  interpreted  as  the  uncertainty  associated 
with  the  event  (e  ■  g}  or  the  uncertainty  removed  (or  information 
conveyed)  by  revealing  that  e  has  taken  on  the  value  g.  H(e)  is 
also  called  Entropy  or  Information  rate  of  5. 

This  measure  depends  only  on  the  probability  distribution  of  the 
messages.  In  particular,  two  messages  with  the  same  probability  have 
their  information  characterized  by  the  same  number  although  they  are 
not  necessarily  equally  valuable  to  the  user,  for  he  evaluates  the 
economic  value  of  a  message  by  the  maximum  profit  he  can  make  by 
using  it.  The  value  of  a  Source  of  information,  as  far  as  the  user  is 
concerned,  is  measured  by  the  maximum  expected  Pay-off  it  can  bring 
him. 

Shannon's  further  analysis  of  communication  systems  relies 
greatly  on  his  measure  of  Information.  In  his  19^B  model  (diagram  1.2) 
a  randomly  produced  message  generated  by  a  Source  is  encoded  into  a 
signal  belonging  to  a  specified  set,  called  vocabulary.  The  encoded 
message  is  transmitted  through  a  noisy  channel,  whose  output  is 
decoded.  The  objective  is  to  select  a  vocabulary  such  that  the 
probability  of  correctly  Identifying  the  input  signal  is  as  large  as 
possible. 
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Diagram  L.2 

Diagrams  1.1  and  1.2  are  identical,  but  our  objective  is 
somewhat  different:  make  the  expected  Pay-off  as  large  as  possible. 
However,  they  are  not  completely  irreconcilable,  as  shown  first  by 
Shannon  himself  in  his  1999  paper,  "Coding  theorems  for  a  discrete 
source  with  a  fidelity  criterion".  Besides,  it  is  intuitively 
obvious  that  there  should  be  some  correlation  between  the  probability 
of  correct  transmission  and  the  optimal  expected  pay-off. 

In  his  1999  model,  Shannon  added  a  new  component  in  his 
Communication  System  between  the  source  and  the  Encoder  and  also  a 
distortion  function  d(e,a).  d(e,a)  is  the  "cost"  of  taking  action 
a  when  the  message  is  e.  In  Economic  terminology  it  is  the  loss, 
of  not  taking  an  optimal  action. 


Diagram  1.3 
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This  new  operator  :  E  A),  mapped  the  messages  e  into 

a  specified  set  of  actions  in  such  a  way  as  to  decrease  the  rate  of 
information  to  be  transmitted  to  a  level  acceptable  to  the  channel, 
but  resulting  in  some  loss  in  pay-off.  The  actions  were  then  trans¬ 
mitted  with  as  small  a  probability  of  error  as  possible. 

Later  on,  several  authors,  including  Yudkin,  Goblick,  and 
Jelinek,  improved  the  source-encoding  procedure. 

Let  us  point  out  that: 

I.  It  is  intuitively  clear  that  their  approach  is  inefficient 
because: 

1)  Double  encoding  *))  is  not  justifiable 

although  more  accessible  to  mathematical  study.  Moreover,  fQ(  • ) 
maps  events  directly  onto  actions.  Thus,  if  an  action  maximizes  the 
pay-off,  given  two  different  events,  these  two  events  will  be  encoded 
in  the  same  message.  This  message  will  specify  that  particular  action. 
Yet  an  error  in  transmitting  that  message  will  result  in  specifying 

a  non-optimal  action  and  thus  may  cause  a  much  greater  loss  in  the 
case  of  one  event  than  in  the  case  of  the  other.  Two  events  e  and 
e'  equivalent  with  respect  to  optimal  action,  are,  in  general,  not 
equivalent  with  respect  to  the  values  of  d(e,a),  d(e',a)  for  varying 
a.  Thus  i|r  (•)  would  replace  the  set  of  "pay-off  reievant  events" 
by  a  generally  coarser  set  of  "action  relevant  events"  and  this 
diminishes  the  maximum  expected  pay-off  (Reference  i). 

2)  i|r^(»)  and  ( * )  are  chosen  with  no  account  taken  of 
the  differences  in  losses  due  to  having  one  rather  than  another 
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communication  error. 

3)  They  handled  the  communication  problem  in  the  following 
way:  on  the  one  hand,  choose  ^(*)  for  the  given  source  and  loss 
function  (in  Shannon's  terminology,  the  distortion  function),  d(*,*), 
only,  on  the  other  hand,  choose  , ( • )  and  ( * ) ,  given  the  channel 
only.  However,  breaking  the  communication  problem  into  these  two 
independent  problems  is  not  efficient  in  most  cases.  i|(  ( • )  and  ^(.J 
should  simultaneously  be  chosen  given:  S,  T(*)  and  d(*,»). 

II.  No  explicit  solutions  are  ever  displayed,  but  only  their 
existence  is  proved. 

III.  Only  code  words  of  fixed  length  are  considered,  although 
simple  examples  show  that  variable  length  encoders  are  often  more 
efficient. 

IV.  The  usual  analysis  is  confined  to  long  blocks  of  events  and 
long  code  words  which  indeed  tend  to  yield  perfect  results. 

This  last  point  is  economically  quite  crucial.  In  practice,  it 
is  often  impossible,  or  would  result  in  great  losses,  to  wait  for  a 
large  number  of  messages  to  pile  up  before  one  starts  to  communicate 
them.  In  this  sense,  the  information  they  carry  might  become 
obsolescent  from  the  users  point  of  view.  A  great  deal  of  work  is 
left  to  be  done  in  this  area. 

We  have  made  no  progress  with  respect  to  II,  III  and  IV.  But 
we  have  given  a  special  emphasis  on  non- asymptotic  results,  so  that  a 
user  who  can  afford  to  wait  for  up  to  N  messages  to  accumulate  might 
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have  some  indication  about  how  well  he  can  do  and  about  how  to  do  it. 

Wte  have  focussed  our  attention  on  I.  In  an  effort  to  tie  the 
given  Source,  Channel  and  Pay-off  Function  together,  we  have  consider¬ 
ed  a  deterministic  correspondence  (to  be  optimised)  between  channel 
input  alphabet  and  the  set  of  feasible  actions.  We  are,  therefore, 
able  to  ascribe  a  value  (or  loss)  to  each  error,  and  thus  to  estimate 
the  loss  due  to  single-step  (source  and  channe  1 )  encoding"  and  also  to 
increase  the  precision  of  the  estimation  of  the  loss  due  to  transmission. 
In  our  procedure,  both  encoding  and  transmission  aim  to  maximize  the 
expected  pay-off.  In  Shannon- Pilc-.Telinek' s,  only  the  encoding  aimed 
to  maximize  the  expected  pay-off,  while  the  transmission  aimed  to 
maximize  the  probability  of  correct  transmission.  Our  upper-bound 
on  the  loss  due  to  communication  is  therefore  better  in  all  cases 
where  the  loss  function  has  a  wide  range  of  values. 

1.3.  Summary 

In  Section  2.1,  a  brief  survey  of  the  main  concepts  of  information 
Theory  is  made  with  an  emphasis  on  a  notion  of  special  Interest  to  us, 
the  Hate-Distortion  Function,  introduced  by  Shannon  in  19^9,  which  we 
will  call  Rate-ioss  Function.  In  section  2.2,  we  introduce  our 
notation  and  definitions,  set  the  relationship  between  Pay-off  and  Loss 
Functions,  respectively  w(e,a)  and  d(e,a),  and  describe  our  scheme. 
A  Processing  (Source  Encoding)  loss  Function  5^(e,x)  and  a 
Transmission  Loss  Function  b^{x,x' )  are  derived  in  such  a  way  that 
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Ee,a  ^d(e»a)  l'l'1C*)*T(*)»t2C*))  <  ^e>x  (^(e,x)  |^( •)) 

+  Exx'  f &2(x,x* )It(-),^2(*)) 

It  is  convenient  to  consider  the  loss  matrices  associated  with  d(*,*), 

51(*,*)  and  &2(*,*).  [d(e,a)]  (respectively  [&1(e,x)J,  [B2(x,x)]) 

th  tji 

is  the  matrix  with  d(e,a)  as  entry  in  the  e  row  and  the  atn 
column. 

In  Section  J,  we  give  a  lower  bound  to  the  average  loss  one  should 
expect  with  a  channel  of  capacity  C.  Theorem  1  states  that:  for  a 
constant,  memoryless  source  C  with  a  finite  iocs  function  d(*,*), 
and  a  discrete,  memory less  channel  of  capacity  f,  there  exists  no 
encoding  and  decoding  procedure  that  yields  an  expected  loss  smaller 
than  R  ^(c ) ,  where  R  1( • )  is  the  inverse  function  of  R(*)>  the 
Rate-Loss  Function,  defined  by  Fhannon.  Corollary  1  states  that: 
for  a  constant,  memoryless  source  C  with  a  finite  loss  function 
d(*,*),  there  exists  no  source  encoding  procedure  that  yields  a 
processing  loss  less  than  K  1  ( H ( x ) )  if  Ii(x)  is  the  entropy  of  the 
channel  input  letters  in  the  vocabulary. 

Section  4  is  devoted  to  Encoding.  In  4.1  the  source  encoding  (or 
Processing)  procedure  originated  by  Shannon  and  improved  by  Yudkin  and 
Jelinek  is  described  in  detail  and  it  is  shown  that  there  are  encoding 
functions  ^(‘)  which  yield  an  average  Processing  loss  as  close  as 
we  please  to  the  lower  bound  of  Corollary  I.  Theorem  II  is  a  converse 


of  Corollary  I.  In  4.2  a  transmission  loss  function  &2(*,*)  :  x  x 
X  -»  Reals,  derived  from  the  encoding  Procedure,  overbounds  the  loss 
when  channel  input  x  is  sent  over  the  channel  and  recovered  as  x* . 

In  Section  b  we  prove  a  transmission  loss  theorem.  Theorem  III 
says  roughly  that  it  is  possible  to  select  vocabularies 
U  *  (u^,  •**>um, ••*,u^},  M  =  en^,  of  code  words  of  length  n, 
u^  ■  M|)X  ),  and  decoding  functions  that  yield,  on  the 

average,  a  transmission  loss  as  low  as  desired,  provided  ^  <  C, 
the  channel  capacity. 

In  Section  6,  Processing  and  Transmission  are  linked  together  to 
give  Theorem  IV  and  IV'.  Theorems  II,  III,  IV  and  IV*  give  in  fact 
upper  bounds  to  the  various  expected  losses.  Theorem  IV  states,  in 
short,  that  there  are  codes  (f  ( • ),  i|r2  ( * ))  that  yield,  on  the 
average,  a  loss,  due  to  communication,  as  close  as  desired  to  the 
lower  bound  in  Theorem  I,  if  Source,  loss  function  and  Channel  are 
matched  in  a  certain  way.  Theorem  IV'  is  a  variant  of  Theorem  IV 
for  limited  length  message  blocks .- 

In  Section  7  we  treat  the  general  problem  stated  in  the 
introduction  and  give  a  tentative  approach  to  the  special  case  where 
the  following  additional  assumptions  obtain:  binary,  uniform  source, 
binary  symmetric  loss  function,  binary  symmetric  channel,  linear 
utility  and  cost  functions. 
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SECTION  P 


2.1.  Basic  Concepts  of  Information  Theory 

2.1.1.  A  discrete  channel,  denoted  by  f  X,p(y  jx  ,Y'>  or 
(p(y|xU  consists  of  two  finite  sets,  X  and  Y,  and  a  non-negative 
function  p(y|x',  defined  for  all  pairs  (x,y  ,  x  <  X,  y  *.  Y  such 
that  2  p(y|x)  =  1  for  all  x's.  X  and  Y  are  called  input  and 
output  sets  of  the  channel  and  p(y|x)  is  the  conditional  probability 
to  receive  y  when  x  is  transmitted. 

It  is  standard  practice  to  consider  the  transmission  of  a 
sequence  of  symbols,  each  symbol  belonging  to  the  input  set  X.  For 
any  positive  integer  n  and  any  set,  for  example  X,  we  denote  by 
X11  the  set  of  n-tuples  (x^,***,x  ^  =  ^  with  each  \  X.  If  a 
sequence  x11  =  (x]_>*''>xn  Is  applied  at  the  input  of  the  channel, 
then  a  sequence  yn  ;  (y^,***,yn  t  Y°  is  received  at  the  output  with 
a  conditional  probability  p(y^, '  “  ,yn lx^,  • • • ,x^  which  has  yet  to 
be  specified  for  all  x^  •••  x^  and  all  n„  We  will  restrict  our 
attention  to  discrete  channels  without  memory.  For  such  channels, 
successive  operations  are  independent. 

A  discrete  channel  (X,p(y!x  , Y ^  is  said  to  be  memoryless  if 

p(ynUn-1  -r  p ( y k  I  ^ 1 

k-1 
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for  all  yn  €  Y11  and  all  xn  e  X°  and  all  n  e  {1,2,...). 

Thus  a  discrete,  memoryless  channel  (X,p(y|x),Y)  is 

characterized  by  a  matrix  with  row  set  X  and  column  set  Y,  whose 

entries  are  p(y|x).  This  matrix  is  called  the  channel  matrix.  In 

this  work  a  channel  will  always  be  a  discrete  memoryless  channel. 

Let  M  and  n  be  positive  integers,  and  0  <  A  <  1.  A  code 

of  length  M,  word  length  n,  and  probability  of  error  <  A,  denoted 

(n,M,A),  for  a  discrete  memoryless  channel  (X,p(y |x) , Y)  consists 

of  a  sequence  of  M  distinct  elements  of  X11,  {u,,  •••jtL.),  and  a 

sequence  of  M  disjoint  subsets  of  Y11 ;  D1,  such  that 

P( D  | u  )  =  2  P(yn|u  )  >  1  -  A  for  m  =  1, 
m  m  n  D  m  - 

m 

p(ynlum)  =  H  p(ykIV- 

k=l 

{u^,  is  called  a  vocabulary'  of  input  messages  or  codewords 

and  D  is  called  a  decoding  set  for  u  . 
m  m 

Practically  one  uses  a  code  as  follows.  A  message  u  is 

m 

selected  arbitrarily  and  tranmitted  over  the  channel.  The  letter 

sequence  yn  is  received  with  probability  p(yn|u  ).  If  yn  e  D. 

in  x 

the  receiver  concludes  that  u^  was  sent.  The  probability  that  any 

message  u  will  be  tranmitted  so  as  to  be  decoded  incorrectly  is 
m 

<  X. 

A  real  number  $  >  0  is  called  an  attainable  tj’ansmission  rate 
for  a  channel  p(y|x)  if  there  exists  a  sequence  of  codes 
(n,Mft,An')  for  pCylx^  with  >  e11^  and  A^  -*o.  The  transmission 
capacity  of  a  discrete  memoryless  channel  is  defined  to  be  the 


supremum  of  the  set  of  its  attainable  rates .  We  may  give  the  follow¬ 
ing  interpretation  to  the  transmission  capacity  i|i *.  If  0  <  i|r  <  f# 
then  one  can  transmit  any  of  ty  e-ary  symbols  per  transmission 
period  over  the  channel  with  an  arbitrarily  small  probability  of  error 
by  making  the  word  length  n  large  enough. 

If  (X,p(y|x)  jY)  is  a  discrete  memoryless  channel  and  q(x) 
is  a  given  probability  distribution  on  X,  then  we  let  p(x,y)  = 
p(y|x)q(x)  and  r(y)  =  -  p(x,yh  We  define 

' — | 

H(y)  =  -  )  r(y)  log  r(y) 
x 

—  t  •  ~i 

H(y|x'1  ^  -  q(x')  p(y|x'  log  pCylxl 

x  y 

where  all  logarithms  are  base  e.  Let 

I ( q^  =  H ( y ;  -  H(y)x' 

=  H( x  -  H(x|yN 

-  p(x,y)  log 

x  y 

In  information  theory  the  quantity  I(q^,  which  depends  on  the 
input  distribution  q(  •  1  is  interpreted  as  the  average  amount  of 
information,  per  transmission,  received  through  the  channel.  The 
maximum  amount  of  information  received  through  the  channel  is  called 
channel  capacity.  It  is  defined  as  the  maximum  over  q(0  of  I(q). 


V 


>r*H-  •  •-s’.. - v*» 


Ik 


p(x'y)  log = E I  p(’t>y)  log 

q  x  y  x  y 

Where  the  max  is  taken  over  all  distributions  q(  O  on  X. 

Ihe  fundamental  Theorem  of  Information  Theory  which  was  first 

proved  by  Shannon  states  that,  for  any  discrete  memoryless  channel, 

* 

ijf  =  C . 


2.1.2.  Pay-off  function  and  loss  function.  The  pay-off 
function  (*>(•,  •)  :  E  x  A  -»  Reals  gives  the  benefit  associated  with 
event  e  and  action  a.  For  any  event  e,  there  exists  at  least  one 
optimal  action  a(e^  such  that: 

u(e,a( e) )  >  'j(e,a^  v  a. 

The  loss  function  associated  with  uj(  • ,  •  'i  is  defined  on  the 
same  domain  by  the  relation 

d(e,a)  =  uj(e,a(e))  -  u>(e,a). 

This  function  is  what  is  called  regret  function  in  Decision 
Theory.  We  used  the  letter  d  because  it  plays  exactly  the  same  role, 
as  far  as  processing  and  transmission  of  information  go,  as  Shannon's 
"single  letter  distortion  measure".  We  want  to  communicate  information 


so  as  to  maximize  the  expected-pay-off.  It  is  actually  the  same  to 


* 
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communicate  information  so  as  to  minimize  the  expected  loss,  or 
distortion. 


2.1.3.  The  Rate-Loss  Function  (Rate-Distortion  Function). 

This  notion,  first  introduced  by  Shannon  is  his  1959  paper, 
"Coding  theorems  for  a  discrete  Soura  with  a  fidelity  criterion", 
would  appear  to  reconcile  the  two  problems  of  communicating  information 
accurately  and  communicating  it  efficiently,  given  a  Pay-off  Function. 

We  will  define  this  function  formally.  Its  interpretation  will 
appear  immediately  and  justify  why,  intuitively,  it  had  to  be  con¬ 
sidered. 

Let  E  =  (1,  •  •  *,g, • • *,G)  be  the  set  of  events  (or  messages) 
and  A  =  (1,  •  •  •  ,k,  •  •  •  ,H)  be  the  set  of  actions.  Let  (E,w(a|e),A) 
be  an  arbitrary  channel  with  input  alphabet  E  and  output  alphabet 
A.  Let  d(e,a)  be  the  loss  function  and  P(*^  the  probability 
distribution  on  the  messages  generated  by  the  source. 

Consider 

(1)  d(w(*|*1)  =  E  (d(e,a))  =  P(e'i  w(a|e)  d(e,a) 

6 )  9-  .  _j 

e,a 

(^  Kw(-M)='  P(e^  w(a|e)  log  v  ,  |e) 

a 

e,a 

By  definition: 


R(D')  =  Inf  I(w(  •  |  •)) 
w( • |  •) 


***$■••* 
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with  the  constraint 


d(w( • |  • ) )  <  D. 

Note  that  I(’^  is  a  continuous  function  of  w(  • |  '  and  that 
the  domain  of  w(*|*'  is  closed  and  bounded.  The  inf  is  in  fact  a 
minimum  when  it  exists.  Moreover,  R(D^  is  decreasing  in  D  since 
as  D  increases  the  domain  of  minimization  increases .  One  shows 
quite  easily  that  R(D'i  is  convex  downward  and  that  the  constraint 
d(w(,|,'''l  <  D  is  equivalent  to  d(w(  •  |  0 )  =  D. 

d(w(*|*^)  is  a  measure  of  the  average  loss,  I(w(*|*')  is 
the  average  rate  of  information  through  ( E,w(  a  |  e  )  ,A)  .  This  last 
quantity  is  proportional  to  the  effort  we  must  make  to  transmit  the 
messages .  We  would  like  to  make  both  of  these  quantities  as  small  as 
possible,  which  of  course  is  not  feasible.  So,  given  the  source  and 
P(  • '  and  d(*,»),  it  is  important  to  know  what  is  the  smallest  rate 
of  information  consistent  with  the  maintenance  of  a  loss  no  greater 
than  some  specified  level,  or  equivalently,  what  is  the  smallest  loss 
we  can  achieve  if  the  rate  is  fixed. 

The  answers  to  these  questions  are  given  by  R(D'',  the  so-caUed 
Rate-Distortion  function,  or  Information  rate  of  the  Source  for  a  loss 
level  D.  (This  has  to  be  proved  because  the  minimum  was  taken  over 
a  very  restricted  class  of  information  systems.'  Shannon's  coding 
theorem  states  that,  with  some  mild  restriction  on  P(  • ')  and  d(  *,•'>, 
R(D)  is  the  minimum  achievable  rate  of  information  consistent  with 
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for  any  £  >  0,  there  exist  codes  with 
Rate  <  R(D)  +  £.  Conversely,  there  exists  no 
D  and  rate  less  than  R(D). 
is  found  to  have  the  following  general  shape: 


X 

4.  M  }  >  «  i  '/*  #. 

where:  D  .  =  S  P(e'i  min  d'e,a)  =  0  because  our  distortion  function 

min  e  a 

has  the  property  that  for  any  e  3  an  a  such  that  d(e,a^  =  0. 

This  point  is  achieved  with  a  deterministic  channel  with  w(a(e')  le)  =  1 

and  w(a|e'l  =  0  for  all  a  j  a(e'.  The  corresponding  value  R(0)  of 

R(D'l  is  the  entropy  H(e'  of  the  source. 

D  =  min  .71  PCe1*  d(e,a')  is  the  minimum  achievable  average 
max  a  6 

loss  with  no  information.  Here  the  w(a|e)  matrix  has  a  column  of 
ones,  all  the  other  entries  being  zero.  The  capacity  of  such  a 
channel  is  null  because  it  has  identical  rows  (see  Ash,  for  example'. 

For  a  binary,  uniform,  memoryless  source,  with  the  loss  function: 

0  if  e  =  a 

d(e,a’>  -  1  -  A(e,a)  =  < 


Ea  (d(e,a) }  <  D.  Or, 

E  (d(e,a)  }  <  D  and 
6  ;  a  ^ 

code  with  average  loss 
Typically,  R(D) 

K".*,  4 

_ . 


■  1  if  e  /  a. 


2.2.  Notation  and  Definitions 


,  .00 

2.2.1.  The  source  s  produces  a  sequence  °f 

messages  (or  events'!  at  a  fixed  rate  of  1  message  per  second,  each 
e,  being  taken  at  random  from  a  finite  set  E  =  il,  •  •  •  ,g>  •  •  •  ,G) . 

K 

.  .0° 

The  process  is  a  sequence  of  independent,  identically 

distributed  random  variables.  Prob  (e^  =  g}  =  P(e  =  g)  =  P(e)  V 
k  =  1,2, • • • . 


2.2.2.  The  channel  K  is  discrete,  memoryless  with  input 
alphabet  X=  (1, • • ■ , i, • • • , I ),  output  alphabet  Y=  {l,  •  •  • , j, • • • , J} 
Prob  (y  =  j|x  =  i)  =  p(y|x).  The  channel  capacity  per  use  is  C, 
and  it  can  be  used  at  most  once  every  second. 


2.2.3.  The  actions  form  a  sequence 
chosen  within  a  finite  set  A=  (l,  •  •  •  ,k,  *  *  •  ,H) . 


each  a^  is 
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2.2.4. 

r - - > 

E  X  E  •  •  •  X  E 


Blocks  of  n  events  en  =  e,  •  •  •  e.  •  •  •  e„  e  En  = 

1  k  n 

are  encoded  into  blocks  of  n  channel  input  letters : 


n 


*k 


x  e 
n 


,n 


Blocks  of  n  output  letters  are  recovered: 

yn  -  yx  •••  yk  •••  y„  «  A 


which  are  decoded  into  blocks  of  n  actions: 


n  _  .n 

x  =  x,  •  *  •  x.  •  •  •  x  t  A  . 

1  k  n 

REMARK:  As  was  said  in  the  introduction,  it  is  not  efficient 
to  restrict  ourselves  to  vocabularies  where  all  words  have  equal 
length . 


2.2.5.  A  code  of  length  H  consists  of:  a  vocabulary  U  of 
M  channel  words  of  length  n: 


U  =  (un , • **,u  ,  • 
1  in 


u  e  X11 


and  of  two  functions  respectively  called  Encoding  and  Decoding 


functions : 
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^  :  E11  -»U  C  X11 


2.2.6.  The  rate  of  a  code  is  defined  to  be 

•ty  =  ^  log  M  where  the  log  is  of  base  £.  We  will  sometimes  use 
log^  to  express  final  results  because  a  bit  of  information  is  more 
readily  interpretable  than  a  nat.  It  suffices  though  to  remember 


that 


1  nat  = 


1  bit 
log2  e  * 


2.2.7.  The  loss -measure.  We  recall  that  the  loss  when  event 
e  has  occurred  and  action  a  has  been  taken  is  defined  to  be: 


d(e,a)  =  u(e,a(e))  -  u>(e,a) 

u(e,a(e’i )  >  u(e,a)  v  a, 

where  u(e,a)  is  a  finite  Pay-off  Function.  d(e,a)  is  a  non -negative 
function  of  e  and  a  and,  for  e  fixed,  it  assumes  the  value 
zero  at  least  once. 


,,  n  nN 
d(e  ,a  ; 


A  l 
n 


\ 

) 


d<v\> 


k=l 


By  definition: 
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This  definition  implies  that  time  does  not  enter  this  problem.  It 
intervenes  only  through  the  message  and  decision  rates . 

The  treatment  of  time  would  introduce  more  parameters  such 
as:  encoding  time,  transmission  time,  decoding  time,  discount 
factor, .... 

The  overall  loss  for  the  system  of  information  (^(.),T(.),\|r2(*))n 

is  then 

d  =  )  P(en)  Prob  T  (an|en)  d(en,an). 

IT  n 
e  ,a 

Prob  „  (an|en)  is  the  result  of  the  composition  of 
j_(  * )>T(  • )  and  ^2(0  in  that  order. 

2.2.8.  Processing  and  Transmission  Loss  Functions.  We  have 
to  cope  with  a  major  difficulty  in  connection  with  i|r^(  •) :  a  channel 
input  letter  cannot  be  loaded,  on  the  average,  with  an  amount  of 
information  larger  than  or  equal  to  the  channel  capacity,  C,  because 
the  information  which  was  loaded  will  eventually  be  entirely  lost 
after  transmission.  If  H(e)  >  C,  one  is  forced  to  resort  to  what 
transmission  engineers  call  a  noisy  code,  Le.,  a  code  where  the  same 
word  may  represent  several  messages.  In  doing  so,  we  present  the 
channel  with  coarser  information,  that  is,  information  of  lesser  value. 
The  information  is  coarsened  to  the  extent  necessary  so  that  it  can 
be  carried  through  the  channel. 

Practically,  an  efficient  choice  of  i|r^(*)  and  ^2(*)  is 
possible  only  if  one  has  a  measure  of  the  loss  (of  information  value) 


due  to  and  a  measure  of  the  loss  due  to  T(*)  and  f2(») 

such  that  the  total  loss  is  given  hy  processing  (^(0)  +  transmitting 
(T(  •)  *  f2(  •))  losses. 

Unless  there  is  a  well-defined  relationship  between  channel 
input  letters  and  actions,  it  seems  difficult  to  derive  these 
measures  from  d(.,.).  For  the  sake  of  simplicity,  we  have  assumed 
I  =  H  (i.e.,  the  channel  input  alphabet  and  the  action  set  are  of 
equal  size),  which  lead  us  to  consider  all  possible  1-1  correspond¬ 
ences  between  X  and  A.  Not  making  this  assumption  would  bring 
about  more  complex  associations,  but  the  increased  difficulty,  we 
think,  is  not  insurmountable. 

Let  momentarily  denote  x  the  x  associated  with  a  particular 
a: 

DEFINITION.  6. (e,x  )  =  d(e,a)  is  called  Processing  Loss 
X  8. 

Measure  (Function!  &^(en,xn)  =  ^  &i^ek,xn^  * 

The  average  Processing  Loss,  &^,  equals 

Z  n  P(e^ 
e 

The  transmission  loss  when  x11  is  sent  and  x,n  is  received 
is  given  by 


However,  the  proof  of  our  transmission  loss  bound  required  a  single 
letter  measure  (i.e.,  62(x'n,xn)  =  ”  ^=1  62^x'k,xk^*  That  trouSht 


about  the  following  definition. 
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DEFINITION.  5  (x,x')  =  max  {Ze  p(e)X  ,(e)  6  (e,x')  -  &1(e,x)]; 
2e  P(e^Xx'x(e)  [6i(e,x)  -^(efX')]},  where  X^e)  is  the 
indicator  function  of  {e  :  &^(e,x*)  >6^(e,x)). 


It  is  justified  by  the  fact  that 


P(en)  [&1(en,x,n)  -  &1(en,xn)  ] . 


2.2.9.  Transmission  Scheme. 


i*  * 


Transmission  Loss 

E  nC81(e“,*1(e*J))  J  =  <  E  (&p( xn,^?(yn) )  j  =  6, 


Processing  Loss 

n  .  /  nx 

e 


n  n 
x  ,y 


Total  loss  <  E  n  {61(en41(en))  j  +  E  {&2(xu,  ^(y") )  j 


n  .  ru 


n  n 
x  ,y 


SECTION  3 


The  lower  bound  on  Communication  Loss  for  a  given  channel  capacity 

The  theorem  we  are  about  to  state  is  due  to  Shannon  (1959). 

It  answers  the  question,  'What  is  the  smallest  average  loss  one  should 
expect  with  a  channel  of  capacity  C?' 

THEOREM  I.  For  an  independent  memory  less  source  S  with  a 
finite  loss  function  d(*,*)>  and  a  discrete  memoryless  channel  of 
capacity  C,  there  exists  no  encoding  and  decoding  function  that 
yields  an  average  loss  smaller  than  R_1(C),  where  R_1(*)  is  the 
inverse  of  the  Rate-loss  function  R(0. 

In  other  words,  for  any  code  the  avera8e  loss, 

D  >  R-1(C).  We  will  give  a  condensed  version  of  Shannon's  proof: 

Suppose  D  is  the  average  loss  for  a  block  code  of 

length  n.  nC  >  l(xn;yn)  by  definition  of  capacity  for  a  discrete 
memoryless  channel  l(xn;yn)  >  l(en;an)  by  the  data  processing 
theorem  (Feinstein  S  .3.3)  I(en;an)  =  H(en)  -  H(en|an)  > 
s£=1  [H(ek)  -  H(ek|ak)j  because  H(en|an)  =  H(e1|a1  •••  an)  + 

H^le^a^.  ...  an)  +  •••  +  H(  ejv  *  *  ’’Vl’V  *  *  ',an)  - 

H^el^al^  +  +  H^en^an^* 


2k 
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Now  l(e^,a^)  >  nR(D')  by  definition  of  R(D) 

C  >  R(D)  D  >  R-1(CM  because  R(*)  is  decreasing. 


q.e .d. 


COROLLARY  I.  For  an  independent  memoryless  source  s  with  a 
finite  loss  function  !(•,•',  there  exists  no  encoding  function  that 
yields  a  processing  loss  smaller  than  R  ^(H(x')  if  FKx''  is  the 
entropy  of  the  channel  input  letters  in  the  encoded  messages. 


Proof.  This  result  clearly  holds  when  the  channel  function 
T(*)  and  (‘l  are  identity  transformations. 


r,  n  n  A  , 

I(x  ;y  =  H(x  )  =  H(xi 


the  same  string  of  inequalities  yields 


C  >  H(x'  >  R(D'  A  D  >  R  (H(x  ) 


q.e .d. 


In  the  next  section,  we  will  prove  that  there  exist  \|r^(A's 


such  that 


\  <  D 


0  <  D  <  D 

-  -  max 


H(x'  <  R(D>  +  £  v  e  >  0 


SECTION  4 


4.1.  The  Processing  Loss  Theorem.  An  Upper  Pound  to  Processing  Loss 

We  are  faced  here  with  the  problem  of  processing  the  messages 
from  the  source  (source  encoding'  in  such  a  way  as  to  decrease  the 
rate  of  information  to  be  sent  through  the  channel  so  that  there  will 
be  a  least  possible  loss  in  information  value  for  the  user.  The 
relevant  loss  measure  is  b  (•,•)  :  E  x  X  -» Reals .  We  recall  that  a 
1-1  correspondence  was  established  between  A  and  X  and  that 
b^(e,x)  was  defined  to  be  equal  to  d(e,a'  for  the  associated 
couple  (x,al.  It  follows  that  the  rate  loss  function  for  the  source 
b,  with  6  (•,•',  is  identical  to  R(*'. 

The  encoding  procedure  we  will  describe  is  due  to  Shannon  for 
its  basic  idea,  which  was  later  improved  by  Yudkin,  and  to  Jelinek 
for  its  final  version. 

The  problem  was:  given  a  memoryless  source,  governed  by  a 
distribution  P(  •  over  the  outputs  e  e  E,  an  alphabet 
X  •-  Cl,  *  *  * ,H }  and  a  loss  measure  :  E  x  X  -»  Reals,  let 

denote  an  encoding  function  that  maps  sequences  e11  onto  a 

set,  U  =  {u^,***,u^}  M  sequences  of  letters,  u^  =  (x^,  *  *  * ,unm^ 

called  a  vocabulary.  What  is  the  least  obtainable  value  of  the 

average  loss  E  {b^(  en,i(i^(  e11) ) }  and  what  is  the  corresponding 
e 


2o 
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The  only  general  answer  to  this  question  is  obtained  through  a 
"random  coding  argument",  which  involves  "threshold  type  encoders" 
proved  to  be  efficient  for  large  n's.  In  Shannon's  approach,  the 
threshold  was  a  constant.  In  Yudkin's,  it  is  a  function  of  the 
sequence  e11  designee  in  such  a  way  as  to  minimize  the  expected 
processing  loss.  We  may  think  of  6  (e.x)  as  a  distance  between 
e  and  x  (although  it  does  not  necessarily  have  the  properties  of  a 
mathematical  distance)  and  say  that  an  optimal  \|k(*)  should  map  each 
sequence  e11  onto  its  closest  code  word  in  the  vocabulary. 

We  will  now  reproduce  the  main  steps  of  Jelinek's  proof.  We 
will  need  these  in  Section  4.2. 

is  defined  in  the  following  fashion,  given  U: 


( 


/  \|r,(en'>  -  u  if  6.  (en,u  )  >  d„(en)  for  m' 

r  ylv  u  m  V  ’  m  0 


=  1, » • • ,m  -  1 


and  5i(e  >um)  <  d0^e 


*> 

/ 


i|r^(en^  -  u^  otherwise. 


V 


d1(e11')  is  a  function  over  En  whose  exact  form  will  be 
determined  later  according  to  the  statistical  properties  of  the  source 
and  the  loss  function. 

Let  d  be  the  largest  element  in  the  loss  matrix  [b.(e,x)6] 
max  0  1 

and  the  characteristic  function  defined  on  En  as  follows: 
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1  if  &(en,u  )  >  d_(en)  for  all  m  =  1,  ***,M 

i  ra  Cr  ’  ’ 

J 

‘l 

^  0  otherwise. 


The  expected  loss,  given  u,  and  d  (•',  can  be  bounded: 


E  n  -  b1  <  p(e"'  d0(enl[l  -  ^(en  1 

e 

n 

P(en)  0  (en'i 

max  _  ^1  U 

n 


+  d 


In  order  to  estimate  the  bound,  a  random  coding  argument  is 

resorted  to.  Let  us  assume  that  the  code  words  u  of  U  are  select- 

m 

ed  independently  at  random,  with  probability 


n 

Prob  (u  =  x  =  n 
m  m 


Ji  q(x*’ 


m  m 


a 

Prob  (u )  =  Q(u  ) , 
-  *,  ra 
m=l 


where  q( * '  is  an  arbitrary  distribution  on  X.  Then: 


^1  "  n  P(Gn''d0(e 


n  .  /  n 


n.  ,  ,  n. 


(^. 1.2) 


.  n/  n,  :  n  k  t  n  nN  ^  ,  ,  n.,  ■,  ,M 

+  d  P(e  )  {P  lx  :  o  (e  ,x  '  >  d_.(e  }} 

max _  '  q  1  0 


The  M1,  power  appears  because  the  M  codewords  are  selected 


independently. 


iimijurjon! 


t 
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X“1 

Now,  by  the  inequality  log  x  <  x  -  1  or  x  <  e  ,  the 
above  inequality  car  be  written: 

*  ‘1 

6  <  P(en)dJen'  +  d  /  ?(en)  exp  f-P  (x11  :  ( en,xn'i  <  cL ( e11) } ]  -M 

1  —  v  '  0  max  Z_ 1  9  1  u 


n 


(4.1.2) 


_  .  T  A  ,  n  _  ,c  /  n  n^  ,  (  n,  >  n^ 
Let  J  =  {e  :  P  {^(e  ,x  )  <  dQ(  e  )  )  <  jjj . 


Then: 


2  P(e"l(P  (&1(en,xn)  <  d0(en^)M  <  }  P(en)  +  N  P(en)e"n 


Tl^  T 

e  ej 


n  TC 
e  ej 


-n  ,  _/  n, 

<  e  +  P(  e  1 

n 

e 


ne 


-nil- 


a 


j  Pq en,xn^  <  dQ(en)  )  j 


a  >  0 


(4.1.4) 


The  last  term  is  obtained  by  using  the  usual  bound  to  the  indicator 

A  1 

function  of  J  and  the  definition  of  the  rate  of  a  code  =  —  log  M. 

A  lower  bound  to  the  probability  in  the  denominator  is  found 
in  Chapter  8  of  Fano’s,  "Transmission  of  Information". 


S 


It  can  be  shown  that: 
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P  {&l(en,5cn)  <  ^  7'n(p,en)}  >  exp  [7n(P,en')  -  P7'n(P,en) 


(4.1.5) 


+  ~  log  2im  -  B(p,en) ]  -°°  <  p  <  0 


Where:  -  B(p,en)  is  a  monotonically  decreasing  function  of  p,  is 
independent  of  n  and  is  bounded  for  all  en,s,  as  long  as  |p|,G,H 
are  finite. 


*i\  _  \ 


T  p6l(  VX - 

-  7n(p,eu)  =  N  7(p,ek)  =  2j  log  ^  -  ]  (4.1.6) 


k=l 


k=l 


If  we  choose  dQ(en''  =  ^  /'n(pjen),  (3.1.3)  becomes 


GH 


6  <  /  P(en)  -  y '  (p,en)  +  d  [e"n  +  rT^  p(p)e  n ^ "P-( cr)  ^  ] 

1  —  Z_,  v  n  n  ’  max  —  — 

®  (4.1.7) 


-00  <  P  <  0 


0  <  0-  <  1 


Where:  n(«r)  -log  [  >'p(e)  e-"(7(P,e)-P7'(P,e)) } 


-  p(p )  =  (27T)  c  exp  [max  B(p,en)  ) 

ea 


Finally,  provided  3  cr  e  [0,1]  such  that  M- *  ( cr)  =  ^ 


GH 


F  <  /  P(e)7'(p,e)  +  d  [e  +  n  p(p)e‘ 

x  “  l _ ■  max  — 

e 

-“  <  P  <  0 


~n  2  0/rtx  -n(oM.'(o-)  -  M-W 


(4.1.8) 


let  us  define: 


al*)& 


P6x(e,x) 


p8-(e,x') 
•  \a  1 


Sx'£X 


so  that 

7'(P,e)  =  ^  w  (x|e)8  (e,x) 

L  I  P  -1- 

X 


and 


1  - 


<y 


pt®'X  wpfxle^i(e^x^  +  4max  (•'" 


GH 

2 


+  n-p(p)e-n(‘V'(<T)-M<,r')] 


The  second  tern  tends  to  zero  with  n  as  long  as  oty  -  p.(a)  >o. 
We  want  to  minimize  under  the  constraint: 


/  P(e')  wp(x je)81(e,x>  <  e  [0,DM1 

e,x 


P(e)  wp ( x | e 'l  log  Pq^  ] 

(y.1.9) 

U  (Aj_)  =  l(p,q(  0)  :  ■  P^  wp(x|e)&1(e,x')  <  \'>  P  <  0  J 

e,x 

It  turns  out  that  =  (4.1.10)  for  a  complete 

proof,  I  refer  to  Jelinek,  Chap.  11  section  11.5  and  11.4. 


A  .  (A.)  =  min 
Yminv  1  /  /  x 


\ 


(p,q(0)6A(AL)  lL, 
e.x 
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THEOREM  11.  Let  S,  a  constant,  memoryless  source  governed 
by  a  distribution  P(*)  over  the  messages  e  e  E,  a  loss  matrix 
[6  (e,x)],  e  e  E,  x  e  X,  and  a  number  e  [0,DM]  be  given.  The 
random  family  of  encoding  functions  we  have  considered,  of 

output  sequences  e11  onto  a  set  U  =  (u^,...,!^)  of  M  =  en ^ 
codewords  um  =  (x^,  •  •  •  yield  on  the  average  a  processing  loss 

^  such  that : 


(61(en,*1(enl)  )  <  ^  + 


d  {e 
max 


p-  -no(tA) 
+  nd  p(p)e 


(4.1.11) 


Where 


d  is  the  maximal  entry  in  [&, (e,x)] 
max  J  V 


P(e’)  Wp(x|e)  &1(e,x)  =  A^ 


e.x 


wp ( x  I  e '  = 


pb  (e,x) 

qfx'le 

p6,  (e,xO 

2  ,  q(x')e 


and 


a(i)r,A^)  >  0  provided  ijr  >  R(A^) 


COROLLARY  II.  There  exists  an  encoding  function  of  sequences 
e11  onto  a  vocabulary  U  -  (u^,  . ,.,u^}  c  Xn,  M  =  e11^  whose  loss, 
is  bounded  by  the  same  expression. 


I 


•  v<*  ’,'*rv;  » 


*  >  . 
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4.2.  Derivation  of  the  Communication  Loss  Measure  (Function)  &2(x,x' 

Let  us  come  back  a  few  steps  from  the  point  we  have  reached. 
Let  q(0  and  P  <  0  be  arbitrary.  Let  \(t  be  equal  to  M.'(or), 

0  <  o-  <  1. 

Let  U  be  a  vocabulary  of  e11'*'  codewords 

w  (x|e) 

(♦  >  Se  P(eWp(x|e)  log  Pq^y  )  U»  The  encoding 

rule  is  the  following: 

Encode  en  as  u  if  u  is  the  first  code  word  in  U 
mm 

such  that  &^(en,um)  <  dQ(en);  if  this  never  occurs,  encode  e11  as 
Uj^,  or,  which  is  the  same,  do  not  encode  it  at  all.  We  understand 
how  this  decreases  the  quantity  of  information  to  be  communicated 
it  is  now  <  i|f  nats,  on  the  average,  per  event.  From  (3.1.1) 


n  (81(en,t1(en)u) )  <  P(en)d0(en)  +  d 


max 


n 


P(en)01(en)u 

(4.2.1) 


The  second  term  overbounds  the  processing  loss  due  to  not 
encoding  certain  en,s.  The  first  overbounds  the  loss  due  to  the 
actual  encoding  of  the  other  en,s.  Note  that  this  term  does  not 
depend  on  the  vocabulary  U,  but  only  on  dg(en)  which,  in  turn, 
depends  only  on  q(x)  and  P. 


34 


Case  of  block  length  one: 

There  are  M  =  words  in  the  vocabulary,  chosen  randomly 
with  probability  qlx''.  A  single  message  e  is  mapped  on  =  x 
only  if 

t  “l 

&1(e,x^  <dQ(e)  =  wp(x|e)  ^(e^). 
x 

What  is  the  added  loss  if  u  =  x  is  decoded  as  x'? 

m 

If  ^(ejX')  <  dQ(e)  there  might  be  a  loss,  but  it  has  already 
been  taken  into  account  in  the  (source  encoding  bound)  Processing  loss 
bound.  We  need  only  be  concerned  about  those  e's  such  that 
&1(e,x)  <  dQ(e)  and  ^(ejX1)  >dQ(e).  Clearly 

te  :  61(e,x''  <  dQ(e^  and  &1(e,x’)  >dQ(e))c  {e  :  &1(e,x)  <  ^(ejX1 )). 

Let  X  ,(e)  be  the  indicator  function  of 
xx ' 

(e  :  61(e,x)  <  5^(e,x')).  Then  the  loss  ^(x,x')  due  to 
Transmitting  and  decoding  input  x  into  x' 

*2(x,x')  <  p(e)Xxx,(e)  (61(e,x')  -  &1(e,x))  (4.2.2) 

e 

DEFINITION.  The  transmission  distortion  function  is 


55 

—  i 

&p(x,x')  =  max  {  ^jP(e)Xxx,(e)(61(e,x') 
e 

' — i 

/  P(e'X  (e)(&1(e,x) 

/  v  x'xv  v  lv 

e 

This  definition  is  unnatural  a  priori.  The  consideration  of 
is  justified  by: 

CL 

-  6n( • , • )  is  not  function  of  the  particular  code  at  hand 
whereas  4  (•>•)  is.  It  is  function  of  the  given  of  this  problem: 

P( • )  and  d( • , • ) . 

-  i0(x,x')  <  60(x,x’^  vx  and  x' . 

n^ 

-  ip(xn,x'n)  <  ^  B2(x^,x'k' .  This  last  property  is 

k=l 

crucial  in  the  proof  of  theorem  III.  It  will  be  proved  below. 

Case  of  block  length  n: 

LEMMA.  The  loss  Jt0(xn,x.'n)  due  to  transmitting  and  decoding 
channel  input  x11  into  x,n  is  over -approximated  by  the  single 
letter  loss  function  &0(x,x’): 

n 

•  ~\ 

yxn,x,n)  }  &2(xk,x,k'). 

k=l 


-  &1(e,x;) ; 

(J+.2.3) 

-  &1(e,x'))}. 


Proof. 


2(xn,x'n)  <  P(e1,)[&1(e“.x'M)  -  &x(  e“,x“)  ] 


n,  rc  /  n  ,n.  t  ,  n  .  n. 


n  ,n 
e  kF 


where : 


,n  ,  n  _  ,  n  n,  _  .  ,  n  ,n.  . 
P  ^  {e  ;  &  (e  ,x  )  <  S^e  ,x'  )  j 


now: 


P(en)[61(en,x'n)  -61(eu,xu)] 


n  n, 


n  „n 
e  t£ 


rp  1 


n  _n 
e  e£ 


P(e  )  t  (6i(Vx’k)  " 

k=l 


1  claim  that 


(bl(ek’xV  '  &l^ek’xk^ 


k=l 


<  ■  ikj(V*v  -  &i(vxk>>V\ie^ 

ti  k  k 


Indeed,  suppose  that  for  a  particular  e,  ,  X  .  (e.)  -  ).  Then 

k  xRx  R  k 


6.(e,  ,x'  )  <  6  (e  ,x  )  and  therefore,  to  a  non-positive  term  in  the 
1  K  K  —  ±  K  K. 


first  member  corresponds  a  zero  term  in  the  second. 
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n 


■  ^2(xn,x'n)  <  P(e“)  J  Xx  x,  (eR)  (£)1(ek>x'k)  "  Li(ek,xk) 

— 1  :  k  k 

n  k-1 

e 


n.  1 


Now,  since  P(  e  )  -  P(e^)  •••  P(eH) 


,  ,  n  fn, 
^(x  .x'  ) 


kTle  “  k 


J£(x“,x,n)  <  jj  t2'xk'xV  by  -i) 
k=l 


q.e .d. 


REMARK:  &2(x,x’)  is  in  fact  the  product  of  the  probability  of 
x  being  sent  given  that  x  belongs  to  the  vocabulary  u  and  the 
loss  when  the  output  of  the  channel  is  decoded  into  x’  .  Therefore, 
to  compute  the  Expected  transmission  loss,  we  need  only  sum  up  the 
expected  losses  for  each  word  in  the  vocabulary. 


Properties  of  6^(*,*): 

1)  &0(x,x')  _■  0,  b^(x,x')  -  )  when  x  -  x' 

2)  the  [60(x,x')]  matrix  is  square  and  symmetric,  an 
xmportant  property  for  what  follows . 

■))  &  (•,*)  has  the  triangular  inequality  property. 

LEMMA.  &,  (x.,x.)  <&,  (x  ,x, )  +  6  (x  ,x.). 

1  J  ~  C.  1  C  I -  S  'j 

Proof.  Let  us  first  show  that 
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P^eUx1x  (e'l(b1(e,x^)  -  &1(e,x1)) 


<  P(elX  (eMb1(e,xp'»  -  &1(e,x1'l  (h.2.b) 


+  '  p(e)‘X2X  (e)(d1(e,x^'  -  &1(e,xp) 


the  first  member  can  be  written 


P(e 


<  V  (e)f&,(e,x  )  -  &  (e,x  )  +  5  (e,xr 
X^X^  ±  }  L  c.  ±  < 


)] 


Suppose  <  (e^  -  1, 

XlX3 


1  .e . , 


j^(e,x^).  These  cases  are 


possible . 

1)  6^(e,x^  >  &^(e,x01  >  &^(e,x^N  the  two  members  of  the  in¬ 

equality  (4 .2 .4)  are  equal, 

Si  6^(e,xn1  >  6^(e,x^)  >  t^(e,x^)  the  right-hand  side  of  (4.2.J4 
is  larger  because  it  has  a  zero  tern  rather  than  the  negative  term, 


&1(e,x5^  -  b1(e,x0 ■ . 

51  6.(e,x,  >  o  (e,x  1  >  &.(e,x^)  the  right-hand  side  of  (4.2.M 

X  j  X  X  “■  X  c. 

is  larger  for  the  same  reason  as  in  21 . 


If  X  (el  =  0  then  the  right-hard  side  of  (U.f.M  is  larger  than  or 

X1X3 

equal  to  the  left-hand  side,  because  it  is  non-negative.  We  would 
prove  in  the  same  fashion  that: 


*  Y\  '  • 


SECTION  5 

An  Upperbound  to  the  Expected  Transmission  Loss,  Given  a  Channel 

Let  us  be  given  a  channel  (X,p(y  |x)  ,Y) ,  where: 

X  =  {I, H) 

Y  =  (1, *  *  * , J, 

and  a  transmission  loss  matrix  (&2(x,x')).  We  recall  that  &2(*,0 
was  induced  by  &^(  • ,  O  through  an  arbitrary  1-1  correspondence 
between  the  action  set  A  and  X. 

We  want  now  to  associate  channel  input  and  output  letters  in  order 
to  define  a  distortion  measure  between  channel  input  and  channel  output 
letters.  Let  us  call  a(y)  the  action  associated  with  y  determined 
in  the  following  fashion: 

/'  q(x)p(y|x>62(x,a(y))  <  ),  q(x)p(y  |x)6?(x,x’ )  V  x', 

X  X 

where  q(x)  is  the  probability  distribution  used  on  the  x's. 

Define:  &^(x,y)  =  &2(x,a(y) ) . 

We  will  suppose,  furthermore,  that  the  correspondence  between  A 
and  X  has  been  done  so  as  to  minimize 
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1 
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y  q(x)p(y|x^&2(x,a(y)). 

x,y 


The  channel  is  now  adapted  to  the  source  and  the  loss  function 

d(  • ,  • )  • 

We  now  propose  to  ask  the  following  question:  Let  U  be  a 

vocabulary  of  M  sequences  of  c'iannel  input  letters  of 

length  n  :  u  =  {x  ,,•••« x  )  and  let  the  source  messages  e11  be 
m  ml  ran 

mapped  on  the  codewords  u^'s  by  the  rule  given  in  4.1.  The  loss 

when  u  is  transmitted  and  decoded  into  u  .  is  over-approximated 
m  m 

by  ^2^um,um'^’  is  the  test  decoding  function  \lt0(’)  :  Y11  ->  An 

and  what  is  the  least  obtainable  value  of  the  expected  transmission 
loss? 

Ir.  matter  of  fact,  the  only  optimal  decoding  function  is  the 

one  ..nioh  maps  an  output  sequence  yn  onto  the  code  word  u  that 

m 

minimises  the  quantity, 


M, 


Pr(u  , |yn }60(u  ,u  , ) . 
m'  2r  m  m' 


m'=l 


Unfortunately  this  decoding  rule  is  hardly  feasible  in  practice, 
especially  for  large  n’s,  and  moreover  one  does  not  know  how  to 
evaluate  its  performances  in  terms  of  a  bound. 

We  will  instead  use  a  so-called  "minimum  distance"  decoding 
function,  defined  as  follows : 


42 


i 


u  if  5,(u  ,yn)  <  6  (u  ,,yn)  V  m'  =  1,,#*,M 
m  3  m  -  3  m'  J  ’ 


(5.1) 


When  yn  is  received,  *,,(•).  decode  it  as  the  code  word  u 
*  2'  'U  m 

which  is  the  less  distorted  with  respect  to  yn . 

Given  a  vocabulary  U  =  {u^,  we  would  like  to  evaluate 

the  expected  value  of  the  transmission  loss  function: 


Jr 

m=l 


Prlyn|um) 


(5-2) 


As  in  Section  4.1,  we  will  only  be  able  to  evaluate  the  expected 
value  of  this  quantity  over  all  U's  =  (u1,*,*,uM)  generated  at 
random  as  in  4.1.  We  will  prove  that  if  the  code  rate  i|f  is  small 
enough,  the  expected  transmission  loss,  averaged  over  all  U's,  is 
bounded  from  above  by  a  function  which  tends  to  zero  with  n  -» ». 


Proof.  Let  U  be  generated  at  random,  each  word  u^  being 

chosen  independently  with  probability  Prob  {um  =  xn)  =  Q(xn)  = 

11°  ,  q(x,  )  where  x11  =  (x, ,  •  •  •  ,x  .  •  •  •  ,x  )  and  q(*)  is  the  same 
x=l  k  x  1  k  n 

distribution  as  in  4.1. 

Let  xn  =  (xn  •  •  •  x  )  be  a  code  word  and  yn  be  received  when 
v  1  n  J 

xn  has  been  sent.  The  probability  that  the  distortion  &j(xn,yn)  be 
larger  than  some  value  0  is  bounded  from  above  using  a  result  of 
Fano's,  "Transmission  of  Information",  Chapter  8. 


> 
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Prob  {6,. 


f  n  n, 

(x  >y  ) 


,  /  n 
g'(x  ,s 


}  < 


-n[sg'(xn,s)-g(xn,s)  ] 


s  >  0  (5-5) 


where : 


g'(xn,s)  is  an  increasing  function  of  s  and 


g'(xn,0)  =  /  P(yn|xtl)&3(xnJyrl) 

Y11 

n  n  ^  1  \"  ,  VyIV 

>sg’(x  ,s)  -  g(x  ,s)  =  /^  n  Ps^y  Xk^  log  p(y'Tx,  )  -  0 

k=l  Y  * 


increasing  with  s 

s6,(x.  ,x)  ,  , 

,  I  ,  £  ?  k  P(yUk^ 

Y  *k  “  '  sfc3(x^,y') 

2y,  P(y' |xk)e  5 


If  &3(xn,yn)  >  g'(xn,s)  then  the  transmission  loss  is  less  than  or 

equal  to  6_  ,  the  maximal  entry  in  the  [6  (x,x')J  matrix.  If 

^  2  max  2V 

&,(xn,y  )  <  g'(xn,s'l  then  a  transmitting  and  decoding  loss  smaller 

5  “ 

than  or  equal  to  2g'(xn,s')  could  occur  only  if  there  was  another 
code  word  x,n  such  that  &3(x,n,yn)  <  g’(xn,s),  otherwise,  no  loss. 
The  probability  of  such  a  situation  depends  on  y11  and  g’(xn,s) 
only,  since  the  code  words  are  chosen  independently.  This  probability, 
averaged  over  all  yn,s  such  that  &3(xn,yn)  <  g’(xn,s)  is  taken  to 
be  equal  to  the  probability  of  another  code  word  at  a  distance  less 
than  g’(xn,s)  from  xn. 

Let  us  now  estimate  this  probability  using  once  again  one  of 
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Fano's  bound: 


Prob  {x'n  :  &0(xn,x,n)  <  g'(x’n,s)} 


(5.4) 


<  e 


-n[A(xn)f’(xn,A(xn)))-f(xn,>^n))];  A(xn)  <0 


where : 


I  g'(xn,s)  =  f(xn,A(xnl)  or  A(xn)  =0  if  g'(xn,s)  >  f'(xn,0) 


f'(x  ,A)  is  an  increasing  function  of  A 
f'(xn,A)  <  f'(xn,0)  -  )  Q(xn)&  (xn,x'n);  A  < 


X 


Af-(xn,A)  -  f ( xn , A1) 


N  v 

_  n  __ 

k=l  x'eX 


Q,(x’|x  ) 

«A(X'  |xk)  log  >  0  for  A 


is  a  decreasing  function  of  A 


AVXk,X') 

«A(x'|xk’  a 


-x,  q(x')e 


The  probability  that  3  no  other  code  word  in  the  sphere  of 
radius  g’(xn,s)  about  xn: 


1  -  tt  >  [1  -  e-n(Nxn)f,(xn,A(xn})-f(xn,A(xn))]M'1  > 


M 


<  0 


the  probability  that  3  at  least  another  code  word  in  this  sphere: 


the  average  loss  for  U,  averaged  over  all  u's 


1 
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E  (62(xn,i|(0(yn)  ) 

,  n  n 
U,y  ,x 


<  E  (5g'(xn,s) }  E  {e‘ 


-n[A( xn)f  '(xn,A(xn)  )-f(xn,A(xn)  )-f  ] 


n 


n 


+  6  E  (e-n(e'(*n^'-B(xn,s)),  X(xn)<0)s 

2  max  -  - 


>  0 


(5-6) 


Now: 


E  (3g’(xn,s))  =  3  ^  Q(xn)  i  >  7 

_ n  „  i  — 


k=l  n 

y 


=  3  )  q(x)  Pg(y|x)&5(x,y) 

x,y 


because  Q(xn)  =  n£=1  q(x^) 


E  (e-n[A(xn).f'(xn^(xn))-f(xn,A(xn))} 


n  - 


Q(xri)e"n^(xn)’f,(xn»A(xn))"f^xn»^xn^ '  (5.7) 

(xn:A(xn)<0) 


+  ^  Q(xn)e_^  because  Af’(xn,A)  -  f(xn,A)  =  0  for  A  =  0 

(xn:A(xn)=0) 


1+7 


Q(xn)  =  Pqlx11  :  ^(x11)  =  0} 

(xn:A(xn)=0)  _ 

=  P  {,  Q(x'n)6  (xtl,x'n')  <  /  P  (yn  |xn)6,(xn,yn) } 

Q.  / i  ^  i.  s  j 

fn  n 

x1  y 

by  (5.4).  Q(x,n)  =  nj=1  q(x*k).  .‘.as  n  -»»: 

-  /  Q(xn)&2(xn,x,n)  -»  ^  q(x)q(x’)&2(x,x') 

x»n  x,x’ 

j 

y  Pg(yn|xn)&5(xn,yn)  -*•  V  q(x)Ps(y|x)&5(x,y) 
yn  x,y 


But  2  q(x)  P„(y|x)5  (x,y)  =  2  q(x)  P(y|x)&  (x,a(y))  because 

x,y  s  3  x,y  s>  c 

of  the  association  between  Y  and  A,  by  this  association: 

}  q(x^p(y|x)S2(x,a(y))  <  )  q(x)q(x’ )&2(x,x’ ) 

x,y  xx' 

since  2  q(x')  P  (y|x)6  (x,a(y))  is  an  increasing  function  of  s, 

x,y  s  d 

we  can  choose  s  small  enough  to  ensure 

r1  T 

q(x)  P_(y|x'l&_(x,a(y))  <  q(x)q(x’ )&_(x,x' ) . 

/...  j  S  d  L i  d 

x,y  x,x' 

for  s  small  enough  Prob^  ^x’n  Q(x'n'  &2(xn,x'n'l  - 
2  P  (yn  |xn)5..(xn,yn)  )  tends  to  zero  exponentially  as  n  ->°°. 

ft  S  j 
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Now,  because  of  the  continuity  in  A  of  Af'(xn,A)  -  f(xn,A) 
and  because  of  the  continuity  of  the  expectation  operation,  there 
exists  a  maximal  A  <  o  such  that: 

-nE  (A  f'(xn,A  'i-f(xn,A  )} 

n  c  '  'a  '  '  c  ' 


^e-n[A(xn^f  '(xn,A(xn))-f(xn,A(xn))]  ^  <  g  xn 


j  E  {f’(xn,As))  =  1  q(x’)QA  (x|x’)62(x,x')  >  q(x)Pg(y |x)6^(x,y) 


x,x' 


\ 


E  (Af’(xn,A)  -  f(xn, A) }  =  E  {i 


n 


n 


x»y 


nl  1  ^*'K)  108  \WT  1 


x  k=l  x' 


Q,(x* |x) 

E  (Af'(xn,A)  -f(xn,A)}=  q(x)QA(x'|x)  log  — q(x')"  (5.8) 

n  i 

X  x,xf 


Finally,  in  order  that  En  (£-"[M*n>f(xV(xn)  )-f(xn,  Afx11) )  l+n*  , 


tends  to  zero  with  n  -» 00 ,  it  suffices  that 


Q-x  (x'  |x) 


*  <  q(x)Q^  (x'  |x)  log  -  S^~,y 


(5.9) 


XX ' 


In  the  same  fashion 


E  (g'(xn,s) )  =  Ex  (g'(x,s) } 

n 

x 


and 


t 
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E  ^e-n[sg'(xn,s)-g(xn,s)  ] j  =  E  ^-n[sg'(x,s)-g(x,s) ] j 


We  have  proved,  so  far,  that  the  expected  transmission  loss,  averaged 
over  all  U  =  (M  -  en’^) 

(x’k 

-n[>:  ,q(x)Qx  (x’|x)]qg  S/,'v  -*] 

/E  nl82(xn,^(ynn<3Ex  (g'(x,s)  }e 

•  u,y  ,x 

< 

+  6,  E  vith  s  >  0 

)  2  max  x  — 

/ 

I 

j  r — i  i 

)(  q(x)QA  (x’ |x)&2(x,x')  >  q(x^Ps(y|x)&5(x,y) 

^xx'  S  x,y  (5-10) 


Let  us  ask  now  a  question  of  theoretical  importance  (its  economic 
relevance  cannc.t  be  asserted  unless  the  utility  function  of  the 

information  system  and  the  various  cost  functions  are  given) : 

* 

What  is  the  supremum,  ,  of  the  permissible  code  rates? 

* 

A  priori  i|r  <  C  the  capacity  of  the  channel. 

* 

Let  us  prove  that  \[j  >  C: 
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convexity  0  of  log(  • ) 

p(ylx)  log  <  /  P( y  1  x)  log  (5.14) 

ly:a(y)=x')  (y:a(y)=x') 

From  (3-15)  and.  (3.l4l 


C  <  max 

qH 


q(x)Q  (x*  | x )  log  — 


q(x 


x)  ,  * 
<  *  > 


XX  ' 


because : 


^  q(x)Q*(x'  |x)b?(x,x’)  =  '  q(x)p(y |x)&5(x,y) 
x'x  x ,y 

*  . 

by  definition  of  a(y^  and  Q  (x'|x).  q.e.d. 

* 

\|r  =  C. 

Let  us  summarize  the  results  obtained: 

THEOREM  III.  Let  (X,p(y |x1 ,  Y) ,  a  discrete  memoryless  channel 
of  capacity  C  and  a  transmission  loss  function  6  (•,•)  :  X  x  X  -» 
Reals  be  given.  Let  q(*''  be  an  arbitrary  probability  distribution 
on  X.  Let  U  -  iu,  •••*,u  ,,,*,ull),  M  -  e1^ ,  u  =  (x  1,’,,;u  )  be 
a  vocabulary  of  code  words  of  length  n  generated  a",  random,  each 
word  being  chosen  independently  with  probability  Qfu^)  -  ^ 

Then  the  expectation  over  all  li's  of  the  average  transmission  loss 
for  a  given  vocabulary,  when  a  "minimum  distance"  decoding  function 
*|fo(0  :  Yn  ->  An  is  used,  satisfies  the  following  inequality: 


and  the  hound  tends  to  zero  when  s  >  0  and  \|r  <  i|r(A  ). 

s 
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COROLLARY  III.  There  exist*  a  vocabulary  U  ■  lu^,***^}, 

u  “  (u  .,•••, u  ),  M  *  en*  whose  average  lose,  satisfies 

m  nu.  mn  -  <- 

inequality  ( 5 .1^  . 


section  6 

The  Communication  Loss  Theorems 

We  recall  that,  for  the  sake  of  analysing  the  effects  of 
encoding  on  the  one  hand  and  transmitting  and  decoding  on  the  other 
hand,  we  introduced,  respectively,  the  Processing  Loss  Function 

and  the  Transmission  Loss  Function  6  (*,*)  in  such  a  way 
that  the  expectation  of  the  loss  due  to  communication,  d,  be  less 
than  or  equal  to  the  sum  of  the  expectation  of  the  loss  due  to  encoding, 
6^,  and  the  expectation  of  the  loss  due  to  transmitting  and  decoding, 

V 

d  <  61  +  62  (6.1) 

In  section  four,  we  proved  that: 

-  If  a  vocabulary  U  ^  (u^  *  *  * ,1^),  M  =  e^,  um  =  (uml>,'^uran) 

is  generated  at  random  with  Prob  (ii)  =  !l(xmjc)>  q(’)  being 

an  arbitrary  distribution  on  the  channel  alphabet  X. 

-  If  the  blocks  of  source  messages  of  length  n  are  mapped 

onto  U  in  such  a  way  that  =  u  only  if  5  (en,u  )  <  d~(en). 

i.  U  m  ±  m  —  u 

Then  the  average  processing  loss  ^  =  E  (&^.(en,^.  (en)  )} 

U,xn,yn  1 

satisfies : 
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From  (6.1),  (6.2)  and  (6.5^  it  follows  that: 


|«WH4W*«»4M4  Us\.4^Uj\J.  * 


. . Ml 
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THEOREM. 


(1) 


OH 


(2) 


<  ^P(e)7'(p,e)  ^[e^+n2  P(  p)  e"n^  ] 


(3) 


(4) 


(6.4) 


+  3  (  ^q(x)g'(x,s)  )e 
x 


-nU(\)-^  +  y  q^e-n[sg'(x,s)-g(x,s)  ] 


p  <  0,  0  <  ff  <  1,  s  >  0 


where  (2),  (3)  and  (4)  tend  to  zero  as  n  -»°°  if 

r->  W  (x|e)  r  VX'|X) 

y  P(e)Wp(x|e)  log  f  <  y  q(x)Q^  (x'|x)  log  - ^ -  (6.5) 

e,x  xx' 

and  q(  •)  is  arbitrary. 

Note  that  (6.5)  can  always  be  satisfied  for  q(  •)  and  s  given 

W  (x|e) 

by  talcing  p  <  0  large  enough  because  x  P(e)Wp(x|e)  log  ^q^x) 
els  p  — *  0  • 


Let  us  now  minimize  the  bound  in  (6.4)  .  Two  cases  need  to  be 
considered: 

a)  There  is  no  constraint  on  n. 

Since  (2),  (3),  and  (4)  tend  to  zero  with  n  -»«  when 

p  <  0,  s  >  0  and  *  =  p.'(a)  <  t(  ^  )  >  n  can  be  chosen  large  enough 

s 

to  make  (2),  (3)  and  (4)  negligible  with  respect  to  (l)  .  We  want, 
therefore,  to  minimize  (1)  under  the  constraint  that  (b.5)  is 
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satisfied,  i.e. 


P,q(*)>s 


P(e)W  (x|e)&1(e,x) 


.subject  to 


(6.6) 


W  (x|e) 

P(e)Wp(x|e)  log  q-^y  = 


Q-v  (x’  |x) 


=  4'(o-)  <  /  q(x)Q^  (x'  |x)  log  — |^y 


\e,x 


xx' 


"'X  Now  suppose  pQ,  q^(  • )  and  Sq  are  solutions  to  (6.6),  then 


Qa  (x' |x) 
S0 


Qjx’lx) 


/  N*(x)Qx  (x>  |x)  log  — r-T^) -  <  sup  /  q(x)QA  (x-  |x)Jog— 

Xx'  so  V 

Qa  (x'|x) 

ZXX'  «0<x>\  (x’  lx>  106  1^5 -  5  1 


with  equality  only  if  ( * )  and  Sq  are  solution  to  5.11.  If  that 
is  so  we  will  say  that  the  source  is  matched  with  the  channel.  If  it 
is  not  so. 


Qa  (x’ |x) 
S0 


/ 

XX' 


1  (x' |x)  108 


could  be  used  as  a  measure  of  the  mismatch  between  Source  and  Channel. 

•1 


in  general  minp  s  p(e)wp(x  |e)61(e,x)  >  R  (c),  by  definition 

of  R(d). 


1 


HIWPI 


)  f  1 


PBB3I 8BWKS5WPC3C5 
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THEOREM  IV.  Let  S  be  a  constant,  memoryless  source  which 
generates  messages  e  e  E  with  a  fixed  probability  P(*).  Let 
d(*,  •)  :  E  X  A  -*  Reals  be  the  loss  function  attached  to  S  and  A, 
the  set  of  feasible  actions.  Let  (X,p(y|x),Y)  be  a  discrete, 
memory less  channel  of  capacity  C . 

Let  a  vocabulary  U=  (u^  ••*,uM),  M  =  e11'*',  um  =  (xml>  ’  *  §>xmn) 
be  generated  with  probability  Prob  {  }  =  n^=1  ^mn)*  Let 

the  source  messages  e11  be  mapped  onto  U  using  the  encoding 
function  outputs  yn  of  the  channel  be  decoded  by 

Thcnj 

(1) 

E  (d(en42[T(t1(en)u)  ]u)}  P(e)Wp(x|  ejS^x) 

U,T(-)»en  e 

(2) 

GH 


d  [e"n  +  n2  P(p)e'n[^'^)]] 
max  —  — 


(5) 


(6.7) 


+  3(  V  q(x)g'  (x,  s)  )e  n^^s)  ^ 


L- 

X 


(■>) 

.6,  ,j(x)e-n(3g'(X,s)-g(x,S)] 

2  max  — 


where  P,q(’)>s  minimize 


4 
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P(e)Wp(x|e)61(e,x) 


e 

,  under  the  constraints  p  <  0,  s  >  0 

T  ,  W  (x|e) 

'/  P(e)Wp(xje)  log  =•  \|i=R’  (o') 

e,x 

and  2g  p(e)Wp(x  lejb^ejx)  >  R  1(c) 
Channel  are  matched. 


and 

0  <  a  <  1 

(6.8) 

\  (x* |x) 

<z 

q (x)Q?  (x*  |x )  log 

's 

q(x) 

XX' 

with 

equality  only  if 

Source  and 

COROLLARY  III.  There  exists  a  code  whose  expected  loss 

satisfies  (6.7) 

b)  the  block  length  n  is  constrained  to  be  at  most  equal  to 

N. 


THEOREM  IV’ . 


GH 


3  i  t  P(e)r'(p’e)  +  dmax  [Yn+n2e(p)e-n(<r*-pW>] 

p  (6.9) 

+  5  (V  q(x)6.(x,s))a'nU("s,'|1+^  V  <!(x)e-'[^'(x,S)-g(x,s))) 


X  X 

with  p  <  0,  s  >  0,  n  <  N,  ijr  =  M-*  (a),  0  <  a  <  1. 

The  minimization  is  made  easier  by  the  fact  that  all  the  function 
in  the  bound  are  well  behaved,  either  increasing  or  decreasing  in  the 


various  parameters. 
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SECTION  7 

Treatment  of  the  General  Problem  with  Certain  Properties  of 
the  Source  and  the  Channel  Assumed 


In  this  section,  we  deal  with  a  more  restrictive  set  of 
assumptions,  namely: 

-  the  source  S  is  binary,  memoryless  and  uniform,  i.e., 

E  =  {1,2),  {ek)^=1  is  a  sequence  of  independent  identical  random 
variables  and  Pr  {e^  =  1)  =  Pr  {e^  =  2)  =  ^  . 

-  A  =  (1,2) 

-  The  loss  function  is  symmetric  in  the  following  sense: 


DEFINITION.  A  loss  function  is  symmetric  if  each  row  of  the  loss 

matrix  [d(e,a)]  contains  the  same  set  of  numbers  d  ,  *",d  and 

-L  H 

each  column  of  [d(e,a)]  contains  the  same  set  of  numbers 


[ d(e, a) ] 


We  will  take  d  =1  without  loss  of  generality,  so  that 
max 

E  (d(e,a)}  =  Prob  (error)  per  message,  since  d(e,a)  =1-6  , 

e,  a  ea 

where  6  is  the  kronecker  symbol. 


6o 


6l 


a 


-  The  channel  Is  binary,  symmetric  and  memory  less. 

DEFINITION.  A  channel  is  symmetric  if  each  row  of  the  channel 
matrix  [p(y |x) ]  contains  the  same  set  of  numbers  p', and 

X  J 

each  column  of  [p(y|x)J  contains  the  same  set  of  numbers  •••  q'j. 


P(y|x) 


1 

2 


1  -  P  P 
P  1  -  P 


w.l.q. 


-  The  utility  function  is  linear  in  all  the  criteria,  to  be 
given  later. 

These  hypothesis  have  been  selected  in  such  a  way  that: 

-  the  computations  be  feasible  by  hand, 

-  the  number  of  parameters  be  reduced  to  a  minimum, 

-  the  results  be  simple  enough  to  allow  a  direct  reading  of  the 
effect  of  each  parameter. 

This  will  enable  us  to  discuss,  for  this  case,  the  general 
problem,  stated  in  section  1,  of  the  choice  of  an  optimal  Information 
System  (i.e.,  Encoding,  Channel,  Decoding),  given  a  Source  of  messages 
and  the  user's  utility  function. 

It  is  important  to  note  that  "binary"  can  easily  be  dropped  if 
all  the  other  assumptions  are  kept. 

Let  us  compute  the  right  hand  side  of  (6.9) 
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(1) 


(2) 
GH 


min  {V  P(e)7'(p,e)  +  d  [e"n  +  n2  P(p)e’n^Cr'*,"M'^Cr^] 

p,q(*),8,n,t  ^  ““ 

(3) 

x-1  -n[i|i(A  )-i|f] 

+  3  <l(x)g'  (x,  s  )}e 


OO 


+  6, 


2  max 


V  q(x)e"n£se'(x>s)-g(x>s))]} 


p  <  0,  s  >  0,  \)r  =  p.'(cr),  0  <  a  <  1. 

In  this  perfectly  symmetric  situation,  the  optimal  association 
between  X,  Y  and  A  is,  a  priori: 

X  Y  A 

1  ♦— »  1  « — >  1 

2  <  >  2  < — >  2 

Likewise  the  optimal  q(  * ) ,  V,  p,  s,  n,  i) |r  is  the  uniform 

distribution  q(x)  =  i  V  x. 


A.  Computation  of  (l)  and  (2) 


\X 

e\ 

1  2 

1 

0  1 

2 

1  0 

v-1  P&-]_(e,x) 

7(p,e)  =  log  )  q(x)e 

x 


log  (-|  e1 


PXO  +  1  gPXl} 


7(p,e)  =  log  |  (1  +  ep)  v  e 
eP 

7'(p,e)  =  - — r  V  e 

1  +  ef 

.  loeV  p(e)f[-^(r(p,e)-P7'(p,e))] 
e 

p.(cr)  =  -c r  [log  \  (l  +  eP)  -  -~fiS — -] 

1  +  ep 

p6  (e,x)-7(p,e) 

Wp(x|e)  =  e  q(x) 


1  2 

i 

X/1  +  ep  ^P/l  +  ep 

[W  (x I e ) 3  = 

-P/ 1  +  eP  ^  1  +  eP 

P 

2 

(7.1) 


(7.2) 


(7.3) 


(7.M 


Ps2 


B(p,e)  =  |P|  +  j  +)  W-(7T?)  ■  lp I  +  3  +  (  p""“  V 

^  • - 1  A  '  1  '  p 


P(p)  =  2tt  e 


(7.5) 


from  (7.1)>  (7.2),  (7.3),  (7.5),  and  (6.9). 
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0  [|P|+i+i^] 

(l)  +  (2)  =  — ^ +  (e  n  +  n22rr  e  - 

1  +  ep 


-n[aijr+-cr[log  \  (l+ep)-  ] 

l+ep 


(7.6) 


By  definition 


A  =-^— 

^  1+  ep 


(7.7) 


a  =  1  minimizes  (2). 

From  (4.1.9)  and  (4.1.10) 


log  \  (1  +  eP)  -  =  -  R(Aj_) 


1  +  eh 


(7.8) 


We  will  confine  ourselves  to  without  loss  of  generality; 

from  ft .7)  and  ft. 8) 


(1)  +  (2)  «= 


♦  £.-»  +  2n  n2}  ^  (7.9) 


B.  Computation  of  (3)  and  (4) 


sg'(x,s)  -  g(x,s)  =  log 


p  +  pe 


1  -  p  +  pe 


f '  (x.  A)  =  )  Qa(x'  |x)62(x,x'  )  =  \ - ^Vx 

1  +  e 

x’  - 


Af'(x,  A)  -  f(x,  A)  =  )  Qa(x'  |x)  log 


Q>(x' |x) 


2  AeA 

“  log  - - ^  +  — -  V  x 

1  +  e  1  +  e 


is  the  maximum  A  such  that 


from  (7.14)  and  (7. 17),  and  that  relation  is  true  if  and  only  if 


&  * 


I 
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[Qa  (x*  |x)]  =  [Pg (y|x)] 


(7.21) 


from  (j  .15)  and  (7.18)  ijf  ( A  )  =  sg’(x,s)  -  g(x,s)  and  (3)  + 

s 

(4)  can  be  written 


(3)  +  (4)  = 


-n[  \|r(As)-i)i]  1  -n\(r(As) 


1  +  e 


A  - 

s 


+  2  ^ 


(7.22) 


From  (7.9)  and  (7.22)  we  get 


-n  i  2  ^TTa7  -n[t-R(A^)] 
d  <  /^  +  e  n  +  2ir  e;  n  e  1  1  e 


3e  s  -n[*(Xs)-*]  1  -n*(Ag) 


1  +  e 


A  - 
s 


+  2^ 


(7-25) 


with  0  <  <  2>  s  > 


1  ♦(Ag)  =  sg'  (x,s)  -  g(x,s) 


Let  us  denote  by  F(A  , n,  \|f, s)  the  upper  bound  in  (7.23).  It 
is  easily  checked  that  F(’)  is  convex  U  in  all  the  parameters. 

Now,  A^,  n,  i|i  are  parameters  of  encoding  and  decoding,  whereas 
s  is  bound  to  the  channel.  It  is  intuitively  clear  that: 

the  encoding  gets  more  difficult  as  n  increases  and  A^ 
decreases. 


the  decoding  gets  more  difficult  as  n  and  iji  increase. 
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-  The  channel  gets  better  as  the  maximum  value  of  s  increases. 

In  this  context  and  short  of  being  able  to  do  better,  one  might 
decide  to  choose  an  Information  System  in  terms  of  A^,n,  i|f,s  and 
FCA^n, \|t,s).  One  might  further  decide  to  attach  constant  cost 
coefficients  to  these  parameters  and  try  to  maximize  a  utility 
function  of  the  form: 


U(^n,  f,s,F(Al,n4,s)) 


V  -  V  -  V  ■  y(Ai’n'*’s)- 

(6.21.) 


U(‘)  would  have  a  non-boundary  maximum  since  it  is  convex  H 
in  A,n,  \|r,  and  s. 


Now,  if  we  assume  that  the  user  is  not  interested  in  very  small 

n's,  i.e.,  he  allows  n's  large  enough  so  that  minimizing  (3)  +  (k) 

in  s  would  amount  to  choosing  s  so  as  to  minimize  the 

exponential  terms,  that  is,  he  would  choose  s  so  as  to  maximize 

i|r(As).  But  we  have  proved  in  Section  !?  that  max^^  g  = 

We  have  already  maximized  \| i(A  )  with  respect  to  q(*)*  Therefore, 

s 

for  that  s  that  maximizes  ilr(A  ), 
m  T  s 


y*f  ■  t 
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/-  -n  3  2  -ntt-R(A^)] 

/  d<A1+e  +2jre;n  e1  1  e  X 


3e  m 


X -  e"n^C'^  +|  e'nC  =  F*(A^,n,  f,C) 


1  +  e 


m 


with  0  <  ^  <  | 


m 


m 

2 


pe 


1  +  e 


m 


1  -  p  +  pe  2 


a—  :  *(Ae  )  =  C. 


And  he  would  choose  A^,n,  i|r,C  so  &s  to  maximize 


U  (^»n,  ^,C,F*(A^,n,  i|f,C))  =  -  kQn  -  k^  -  J^C  -  k  ^(A^n,  f,C ) , 


F  (A^,n, t,C)  gives  very  explicitly  the  following  asymtotic 
information: 

If  R(A^)  <  if  <  C,  there  exist  vocabularies  u=  ( •  •  * , 

n  t 

M  *  e  T,  such  that  the  average  loss  due  to  communication  with 

be  leSS  than  ^1  +  where  e(n)  -*  0 


where 
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of  choosing  encoding  and  decoding,  given  a  source  of  events,  a  pay-off  function 
and  a  channel,  as  a  whole.  The  hounds  we  obtained  should,  therefore,  be  better, 
at  least  in  cases  where  the  pay-off  function  has  a  wide  range  of  values. 

We  did,  however,  treat  the  non-restricted  problem  with  certain  properties  of  the 
source,  the  channel  and  the  utility  function  assumed. 


